-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move array ArrayAgg
to a UserDefinedAggregate
#11448
Conversation
c7f10e1
to
a38089a
Compare
This follows how it done for input_type and only provide a single value. But might need to be changed into a Vec in the future. This is need when we are moving `arrag_agg` to udaf where one of the states nullability will depend on the nullability of the input.
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
7088d00
to
9d17c1c
Compare
} | ||
|
||
#[cfg(test)] | ||
mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test for distinct array agg is already coverdd in
datafusion/datafusion/sqllogictest/test_files/aggregate.slt
Lines 173 to 192 in d67b0fb
# test array_agg_distinct with list data type | |
statement ok | |
CREATE TABLE array_agg_distinct_list_table AS VALUES | |
('w', [0,1]), | |
('w', [0,1]), | |
('w', [1,0]), | |
('b', [1,0]), | |
('b', [1,0]), | |
('b', [1,0]), | |
('b', [0,1]) | |
; | |
# Apply array_sort to have deterministic result, higher dimension nested array also works but not for array sort, | |
# so they are covered in `datafusion/physical-expr/src/aggregate/array_agg_distinct.rs` | |
query ?? | |
select array_sort(c1), array_sort(c2) from ( | |
select array_agg(distinct column1) as c1, array_agg(distinct column2) as c2 from array_agg_distinct_list_table | |
); | |
---- | |
[b, w] [[0, 1], [1, 0]] |
Remove the test here doesn't reduce test coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(though it looks like the comments should be adjusted):
# so they are covered in `datafusion/physical-expr/src/aggregate/array_agg_distinct.rs` |
@@ -702,6 +702,7 @@ async fn roundtrip_expr_api() -> Result<()> { | |||
string_agg(col("a").cast_to(&DataType::Utf8, &schema)?, lit("|")), | |||
bool_and(lit(true)), | |||
bool_or(lit(true)), | |||
array_agg(lit(1)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't add array_agg_distinct
since it is not exist, and we can get distinct one with builder mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is builder mode ? You mean a builder for array_agg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @jayzhan211 means that there is a single array_agg
udf function now, and internally it has special accumulators for distinct and non distinct
Thus there is no difference in how to serialize it I think
Maybe we could use the builder to test
array_agg(lit(1)).distinct().build().unwrap(),
🤔
Signed-off-by: jayzhan211 <[email protected]>
d7fd244
to
ac640fb
Compare
Signed-off-by: jayzhan211 <[email protected]>
@@ -63,160 +61,3 @@ pub use negative::{negative, NegativeExpr}; | |||
pub use no_op::NoOp; | |||
pub use not::{not, NotExpr}; | |||
pub use try_cast::{try_cast, TryCastExpr}; | |||
|
|||
#[cfg(test)] | |||
pub(crate) mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no usage
|
||
// TODO: change name to lowercase | ||
fn name(&self) -> &str { | ||
"ARRAY_AGG" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any specific reason for having this in uppercase ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly it is inconsistent with count
which uses lower case:
datafusion/datafusion/functions-aggregate/src/count.rs
Lines 112 to 113 in b685e2d
fn name(&self) -> &str { | |
"count" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why they are upper case initially, however we plan to convert them all to lowercase
if values.is_empty() { | ||
return Ok(()); | ||
} | ||
assert!(values.len() == 1, "array_agg can only take 1 param!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will invoke the panic!
, should we return an Err here ?
if states.is_empty() { | ||
return Ok(()); | ||
} | ||
assert!(states.len() == 1, "array_agg states must be singleton!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. Can we remove all assert!
and return Err
@@ -702,6 +702,7 @@ async fn roundtrip_expr_api() -> Result<()> { | |||
string_agg(col("a").cast_to(&DataType::Utf8, &schema)?, lit("|")), | |||
bool_and(lit(true)), | |||
bool_or(lit(true)), | |||
array_agg(lit(1)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is builder mode ? You mean a builder for array_agg
distinct = true; | ||
protobuf::AggregateFunction::ArrayAgg | ||
} else if aggr_expr.downcast_ref::<OrderSensitiveArrayAgg>().is_some() { | ||
// TODO: remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think it can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OrderSensitiveArrayAgg is not yet removed
fn default() -> Self { | ||
Self { | ||
signature: Signature::any(1, Volatility::Immutable), | ||
alias: vec!["array_agg".to_string()], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using the smallcase as alias ? Shouldn't this be the function name itself ?
ArrayAgg, | ||
array_agg, | ||
expression, | ||
"input values, including nulls, concatenated into an array", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not clear & can we improve for clarity just the other functions ?
datafusion/expr/src/expr_fn.rs
Outdated
@@ -171,6 +171,7 @@ pub fn max(expr: Expr) -> Expr { | |||
)) | |||
} | |||
|
|||
// TODO: remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think we can remove this
Ok(()) | ||
} | ||
|
||
// Append value like ListArray(Int64Array(1,2,3), Int64Array(4,5,6)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: If this is code document, then we can move this inside the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jayzhan211 and @dharanad for the review
I think this PR is good enough to merge, though I think addressing @dharanad 's comments beforehand would make it even better.
We are so close to completing user defined aggregate migration. so close!
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
//! Defines physical expressions that can evaluated at runtime during query execution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//! Defines physical expressions that can evaluated at runtime during query execution | |
//! `ARRAY_AGG` aggregate implementation: [`ArrayAgg`] |
|
||
// TODO: change name to lowercase | ||
fn name(&self) -> &str { | ||
"ARRAY_AGG" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly it is inconsistent with count
which uses lower case:
datafusion/datafusion/functions-aggregate/src/count.rs
Lines 112 to 113 in b685e2d
fn name(&self) -> &str { | |
"count" |
} | ||
|
||
#[cfg(test)] | ||
mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(though it looks like the comments should be adjusted):
# so they are covered in `datafusion/physical-expr/src/aggregate/array_agg_distinct.rs` |
@@ -702,6 +702,7 @@ async fn roundtrip_expr_api() -> Result<()> { | |||
string_agg(col("a").cast_to(&DataType::Utf8, &schema)?, lit("|")), | |||
bool_and(lit(true)), | |||
bool_or(lit(true)), | |||
array_agg(lit(1)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @jayzhan211 means that there is a single array_agg
udf function now, and internally it has special accumulators for distinct and non distinct
Thus there is no difference in how to serialize it I think
Maybe we could use the builder to test
array_agg(lit(1)).distinct().build().unwrap(),
🤔
Signed-off-by: jayzhan211 <[email protected]>
@@ -702,6 +702,8 @@ async fn roundtrip_expr_api() -> Result<()> { | |||
string_agg(col("a").cast_to(&DataType::Utf8, &schema)?, lit("|")), | |||
bool_and(lit(true)), | |||
bool_or(lit(true)), | |||
array_agg(lit(1)), | |||
array_agg(lit(1)).distinct().build().unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
* Add input_nullable to UDAF args StateField/AccumulatorArgs This follows how it done for input_type and only provide a single value. But might need to be changed into a Vec in the future. This is need when we are moving `arrag_agg` to udaf where one of the states nullability will depend on the nullability of the input. * Make ArragAgg (not ordered or distinct) into a UDAF * Add roundtrip_expr_api test case * Address PR comments * Propegate input nullability for aggregates * Remove from accumulator args * first draft Signed-off-by: jayzhan211 <[email protected]> * cleanup Signed-off-by: jayzhan211 <[email protected]> * fix test Signed-off-by: jayzhan211 <[email protected]> * distinct Signed-off-by: jayzhan211 <[email protected]> * fix Signed-off-by: jayzhan211 <[email protected]> * address comment Signed-off-by: jayzhan211 <[email protected]> --------- Signed-off-by: jayzhan211 <[email protected]> Co-authored-by: Emil Ejbyfeldt <[email protected]>
* Add input_nullable to UDAF args StateField/AccumulatorArgs This follows how it done for input_type and only provide a single value. But might need to be changed into a Vec in the future. This is need when we are moving `arrag_agg` to udaf where one of the states nullability will depend on the nullability of the input. * Make ArragAgg (not ordered or distinct) into a UDAF * Add roundtrip_expr_api test case * Address PR comments * Propegate input nullability for aggregates * Remove from accumulator args * first draft Signed-off-by: jayzhan211 <[email protected]> * cleanup Signed-off-by: jayzhan211 <[email protected]> * fix test Signed-off-by: jayzhan211 <[email protected]> * distinct Signed-off-by: jayzhan211 <[email protected]> * fix Signed-off-by: jayzhan211 <[email protected]> * address comment Signed-off-by: jayzhan211 <[email protected]> --------- Signed-off-by: jayzhan211 <[email protected]> Co-authored-by: Emil Ejbyfeldt <[email protected]>
Which issue does this PR close?
Closes #.
Part of #10999, #8708
TODO
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?